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The importance of the multimedia information retrieval (MIR) is highlighted 
by the extensive amount of the information on the internet. Image, audio, 
video, and text are all examples of the characteristics of the raw multimedia 
data. It is greatly challenging to represent a concept of human perception and 
how the machine-level language can grasp it (semantic gap of MIR). 


However, this paper aims to improve the information retrieval model that 
retrieves data from multimedia. This can be implemented by leveraging the 
use of variety of algorithms that go through training and testing to extract the 
model. One of these algorithms extracts text information based on the query 
language's nature as the vector space model (VSM) and the latent semantic 
index (LSI) were used. The other technique uses curvelet decomposition and 
statistic parameters like mean, standard deviation, and signal energy to 
recover these properties. Additionally, a discrete wavelet transforms (DWT) 
and signal characteristics-based method is used to retrieve audio signals. 
Finally, the neural network learning is modeled and trained on a collection 
of different multimedia images. The learned features have been utilized for 
presenting a highly sufficient system of multimedia retrieval which operates 
for a large set of multi-modal datasets. 
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1. INTRODUCTION 

There is huge information on the web at which users can utilize for creating and storing images. 
Which has posed the need for ways of managing and searching those images. Therefore, finding sufficient 
multimedia retrieval approaches have become an important field of interest for scholars. Multimedia retrieval 
approach is a system for the search and retrieval of multimedia objects (texts, images, sounds and videos) 
from a large database of digital libraries [1]. The field of information retrieval (IR) is not new as early IR 
systems used simple tools of word matching for small texts. In the present scenario due to the large volume 
of availability of information sources and very different forms, there is a need of more efficient retrieval 
techniques which can retrieve only the relevant part of the information [2]. People need information 
frequently and this information must be stored in physical storage devices of computers. There are many 
algorithms used to retrieve information [3]. Multimedia information retrieval (MIR) systems deal with 
different types of media (text, image, audio, and video). Although of media is different but there is a common 
factor between them, the common factor is concerned with text. To illustrate this point, the user can search 
any information by introducing some keywords inside the search field such as in Google search engine [4]. 


Journal homepage: http://ijai.iaescore.com 


Int J Artif Intell ISSN: 2252-8938 m) 147 


2. MULTIMEDIA INFORMATION RETRIEVAL 

MIR is a study area of computer science which has the aim of the extraction of semantic information 
from multimedia data sources, which include media which is directly perceivable like video, audio, and 
images, in addition to indirectly perceivable sources like text, bio signals, semantic descriptions, and non- 
perceivable sources like bio-information and stock prices. The MIR system must be able to represent and 
store multimedia elements in such a way that they can be retrieved quickly [3]. The system should be 
therefore, able to deal with different kinds of media and with semi-structured data, i.e., data that has a 
structure which may not be matching or might only partially matching, the structure prescribed by the data 
scheme. To represent semi-structured data, the system must typically extract some features from the 
multimedia objects. The news is a particularly interesting for MIR since many possible sources of indexing 
information are available: speech, audio, video, and text obtained from television, radio, and newspapers [5]. 
Due to the time constraints of continuous daily news coverage and the volume of data, automatic news 
library creation methods are required [6]. Figure 1 shows the general information retrieval system. 


Input 
> |__oueries | 
A 


Feedback 
Figure 1. General information retrieval system 


The IR community has concentrated on the textual retrieval of mainly English documents. Content- 
based retrieval of textual documents is not limited to the plain text but is currently extended to specialized 
kinds of languages, such as formulas, tables, handwriting and special structures like hypertext [7]. Another 
line of research in text retrieval is the complementary application of natural language processing (NLP) 
techniques. The use of (language specific) linguistic knowledge may improve the effectiveness of text 
retrieval, i.e., the quality of tasks such as summarization, extraction, filtering, and categorization. NLP 
techniques in a combination with the use of structure of textual documents can even further improve the 
retrieval [8]. 

The interest in audio retrieval has been relatively low compared to the visual media types. Research 
in audio retrieval and extracting audio features is increasing resulting in different audio (speech and music) 
retrieval systems. To search for a certain sound or type of sound (such as speech of some speaker or music) 
can be a daunting task. Words are inadequate to convey the essence of sounds and there is no standard for 
sound classification. No two listeners will produce the same description for every sound. Humans tend to 
describe sounds by similar sounds, e.g. a buzzing sound for bees. The feature based on matching applies 
where a sound is a single gestalt (short single sounds or longer recordings with a uniform texture, e.g., rain 
on roof) [9]. Because of the confluence of the image processing and database industries, many content-based 
images retrieval systems have emerged. The semantic interpretation of visual information is of a considerably 
higher level of complexity compared to the structured text. In texts, each of the words has a limited number 
of meanings. Through paragraph or sentence analysis, it is possible to determine the precise meaning. Visual 
items that have the same semantic notion exhibit some great appearance varieties, for instance cat's images in 
kids’ books. The search for image retrieval systems is approximate, which means that those systems utilize 
automatically obtained characteristics of an image for determining the visual similarity. The visual properties 
are derived from a process of computation that has been executed on the image object. Simple properties 
(like shape, color histograms, and texture) are calculated based on pixel characteristics, such as position and 
color [10]. 


3. NEURAL NETWORKS IN INFORMATION RETRIEVAL 
Neural networks (NNs) have been applied to IR in a variety of ways. The three main methods are, 

— Transformation network has been proposed for the enhancement of queries. It includes a back- 
propagation NN with one or several hidden layers in which inputs and outputs are schemes of 
representation [11]. 

— The model of cognitive similarity learning in information retrieval (COSIMIR) uses back- propagation 
NN for the match between the document representation and the query. Each of the document and the 
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query play the role of an input to the network that computes their similarity in the output layer. The 
similarity represents of measurement for document relevance to the query. The training data must be 
gathered from a large amount of relevance judgements from users. Therefore, COSIMIR implements a 
function of cognitive similarity [12]. 

— Spreading models of activation are basically hopfield networks that are customarily utilized with nodes 
for query terms and document nodes for the retrieval of the most activated documents. The links are 
weighted based on the term-matrix of the document which is determined using an algorithm of common 
indexing. Spreading models of activation are applied for large amount of real-world data and reached 
sufficient results compared with the results of the statistical models that dominate in IR development 
and research [13]. 


All the models utilize the term “vector” where a document is characterized by some terms within a 
great deal of all terms that occur in the collection of documents. The models come across some issues that 
derive from the size of those sparsely coded vectors. Particularly, results of full text retrieval in large vectors 
that possibly contain every natural language word. Even manual indexing with the use of a controlled thesaurus 
typically produces big vectors [14]. For instance, the Social Science Information Centre’s thesaurus in Bonn has 
22,000 entries. A NN that has similarly large number of nodes needs significant resources [15]. Therefore, 
reduction of dimensions is an approach that has great perspectives for an efficient use of NNs in IR. 


4. NEURAL SYSTEM DESCRIPTION 

The NN models that have been considered in the present paper are feed-forward networks. Input 
features are learned or extracted by neural networks with the use of numerous, stacked fully connected layers. 
Every one of those layers applies a linear transform to the vector output of the previous layer (conducting an 
affine transform). Therefore, every one of the layers is associated with a matrix of parameters that will be 
estimated throughout the learning. Which precedes an element-wise application of a nonlinear activation 
function. Concerning the IR, the result of the whole net is usually some projected scores or a vector 
representation of the input [16]. Throughout the process of training, a loss function is produced via 
contrasting the prediction with the ground trut havail able for the training data, in which training alters the 
parameters of the network for minimizing loss [17]. Which is usually carried out through the classic 
algorithm of backpropagation, Figure 2 shows the neural network IR model. 


Query Keywords Documents 


Figure 2. Information retrieval model of a neural network 


Retrieved multimedia are encoded on input of the first NN [18]. The NN then determines whether a 
certain piece of multimedia in the document is the keyword. If it is, then the occurrence will be added to 
model matrix of the vector space [5], which results in normalized vector space model matrix. This matrix 
weights are matched with the second NN weights that has some keywords as an input and the base of 
multimedia as an output. Combining those 2 NNs, the system of information retrieval is developed in the 
manner which has been illustrated in Figure 3. After that, the input interface will be created, and that enables 
the user to enter a query and after that, the system of information retrieval enables finding the related 
multimedia [17]. Finally, the output interface will be created, which is responsible for sorting the relevant 
multimedia and sending it to the user as results. In the case where there is a necessity for giving as a query 
that includes 2 words or more, there is a necessity of adding more groups of input neurons to the first NN 
where every one of the groups represents multimedia. After that, every type separately generated and there 
are created coherent keywords to every one of the queries. This structure offers the ability of giving more 
keywords as an input in addition to the ability of giving the multimedia connections as input to the NN [18]. 
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Figure 3. Neural network for determination of multimedia information 


The general steps of the proposed system for multimedia database are shown in Figure 4 for three 
multimedia: text, image, and audio. The system consists of algorithms with two stages one for training called 
off-line stage and the other for testing called on-line stage. The first method uses latent semantic indexing 
(LSI) and vector space model (VSM) to retrieve text documents based on the nature of the query language. 
The second technique is based on the Curvelet decomposition-extracted features as well as statistical metrics 
such as standard deviation, mean, and signal energy. This algorithm is called content-based image retrieval 
(CBIR). To extract audio, the third technique uses the discrete wavelet transform (DWT) and signal 
characteristics [14]. 


Query 


Preprocessing 


Extract Query 


Features 


Figure 4. Flowchart of the proposed system 


5. VECTOR SPACE MODEL 

Vector space model (VSM) is a standard IR approach, where documents are characterized via the 
words they contain. It has been introduced by G. Salton in the early 60’s for avoiding some issues of IR [5]. 
In the model of vector space, every one of the documents is represented by an N-dimensional term weight 
vector, with each element representing the weight of each of the N terms in that document. When a document 
group has M documents, the group is represented as a matrix A with MxN dimensions. During the retrieval 
process, the query is also represented as an N-dimensional term weight vector. The dot product or cosine 
coefficient between the document vector and query vector is used to compute the similarity between the 
query and every stored document, and this is usually represented as in Table 1. 
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Table 1. Vector space model 


Term space Term count 
Doc; Doc. ...Docy 
Ti ai a2 a in 
To aa an a on 
Tin a ml a m2 «mn 


The magnitude of the vector for each document by using the Pythagorean Theorem but in this case, 
there are more than two dimensions, so the formula would be in (1) to (3). 


vil] = V (aii)? + (iz)? +.- (Gin)? (1) 
[val] = (Gor)? + (ag2)?+.. + (Gon)? (2) 


Uni = Vv (Qm1)* + (Am2)* +. . +(Qmn)* (3) 


Where: 

V is the length of document vector 

a (i, j) is the weight of term j in document i. 

The inner product of query vector with every single document vector for every term, will compute the score 
via calculating the product of the weight for the query term by the weight of the document term and after that 
compute the summation of the term scores. Those calculations are referred to as the cosine correlation 
measurement which may directly be computed with the use of [13], 


De-1=TERM jp*QTERM jx 


— (4) 
of, (TERM jy)? *Y, QTERM jx) 


when documents and queries are considered as vectors in a multi-dimensional term space of dimension t, the 
cosine correlation measures the cosine of the angle between them. 


COSIN (DOC;, QUERY) = 


Algorithm 1: Steps of vector space model 


Input Text Database (DB). 

Output Documents Features (DF). 
Start 

Compute length DB 

I=1 

While (I<length (DB), I++) 

Compute length Doc (I) 

J=1 

While (J<length (Doc (1), J++) 

Read Term (J) 

If (Term (J) not Stop Word) 

Stem Term (J) using Porter stemmer 
Compute Term Frequency (TF) (This is the frequency of a term inside a document. The frequency is usually normalized within the 
particular document) 

Else 

Remove Stop Word 

End while 

End while 

Compute Documents Frequency (DF) 
Compute Inverse Document Frequency (IDF) (IDF=log (total documents in database/documents containing the term) 
End. 


6. LATENT SEMANTIC INDEXING 

Latent semantic analysis (LSA) or LSI is a theory and approach for the extraction and representation 
of the contextual utilization meaning of words via statistical calculations that are applied to a considerable 
amount of text [18]. The initial phase of LSI is representing the text in a form of a matrix where every row 
represents a distinct word, and every column represents a passage of the text or some other context. After 
that, the entries of the cells are subjected to every cell frequency that is weighted via a function which 
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represents each of the word’s importance in the certain passage and the level to which the type of the word 
carries data. Then, singular value decomposition (SVD) is applied to the matrix by the LSA. In SVD a 
rectangular matrix is divided to the result of the multiplication of 3 other matrices. One of the component 
matrices describes the entities of the original row as vectors of derived values of the orthogonal factor, 
another one similarly represents the entities of the original column, and the third one is a diagonal matrix 
which contains scaling values in a way that when those 3 components are matrix-multiplied, the initial matrix 
is rebuilt. The theorem of SVD for any real matrix A with dimensions (mxn) can be expressed as shown in 
Figure 5 [19], 


A=U*S*Vt. (5) 


Where, U m*r is a column-orthonormal (Ut*U=I), r is the rank of the A, Sr*r is a diagonal matrix, I is the 
identity matrix, V r*n is a column- orthonormal matrix, U is column-orthonormal. When S elements are 
arranged in a descending order, the decomposition is unique. In text document retrieval context, the rank r of 
A equals the number of concepts. U is considered as the similarity matrix of document-to-concept, whereas V 
represents the term-to-concept similarity matrix. For instance, U2, 3=0.6 indicates the fact that concept 3 has 
weight 0.6 in document 2, and V1, 2=0.4 indicates the fact that the similarity between concept 2 and term | is 
0.4. The LSI strong points are strong formal framework completely automatic without the need for stemming, 
can be utilized for multi-lingual search and conceptual IR recall improvement. Weaknesses points of LSI are 
calculating LSI, which is expensive, continuous normal-distribution-based approaches not so suitable for 
count data and usually enhancing accuracy is of a higher importance: require query and word sense 
disambiguation [19]. 


U,(M*k) dx (k*k) 
V7 (k*n) 
A U Document 
= & S: . Vectors 
Term Vectors 
A (m*n) U(M*M) y(m*n) V' (n*n) 


Figure 5. Latent semantic indexing 


7. DISCRETE WAVELET TRANSFORM 

The wavelet transform is an important computational tool for many applications of signal and image 
processing. For instance, it is useful for compressing digital media files (images or audio). In 1D signal, for 
every one of the levels the signal is decomposed to 2 frequency sub-bands high H and low L where L 
represents the low frequency and H represents the high frequency [20]. As illustrated in Figure 6. The 
calculation of the wavelet transform for a two-dimensional signal is involved with sub-sampling and 
recursive filtering. At every level, the signal is divided to 4 frequency sub-bands, which are: low frequency 
high frequency (LH), high frequency low frequency (HL), high frequency high frequency (HH), low 
frequency low frequency (LL) [18], [21]. 


LL2 HL2 
LL1 HL1 HL1 
Original DWT DWT LH2_ HH2 
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image 
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Figure 6. 1D DWT with different levels 


Multimedia information retrieval using artificial neural network (Maha Mahmood) 


152 0 ISSN: 2252-8938 


Feature extraction requires that voice should be clean from all noise therefore the DWT used here so 
that feature extracted per frame of a voice command should have a value within the range of the class the 
voice command belongs. Audio features computed by the following audio features which are zero crossing 
rate (ZCR), spectral flux (SF), spectral roll off (SR), spectral centroid (SC), energy (E), and energy entropy 
(EE), then standard deviation (SD) for each these features to create feature vector [22]—[25]. 


8. RESULTS 

This paper implements MIR system using three algorithms through two phases (training and 
testing). Two models (VSM and LSJ) are used in the first algorithm to extract the text document based on the 
nature of the query language. The second method is based on the characteristics and statistic parameters that 
have been retrieved and obtained using curvelet decomposition, such as mean, standard deviation, and signal 
energy. This algorithm is called CBIR. To extract audio signals, the third approach uses DWT and signal 
characteristics. The model is tested by selecting 1,000 images. The dataset of tiny images on which all the 
experiments are based on, has been gathered by colleagues at New York University (NYU) and 
Massachusetts Institute of Technology (MIT) over 6 months, which are categorized to 10 categories; each 
one of which has 100 images as depicted Figure 7. 

Feature extraction is a fundamental component in MIR systems. These occur throughout the off-line 
pre-processing phase and when users submit an image query to a system. The goal of feature extraction is to 
identify a set of features that can be used to describe each image. In that step, the features of images data are 
obtained from images. These features are used to compare to query feature. After matching the results are 
ranking based on similarity level. Efficiency evaluations are carried out via applying two metrics which are 
referred to as the precision and the recall. The training phase results have shown that the neural network 
based curvelet has greater elapsed time than the neural networks based on DWT or histogram as shown in 
Figure 8. 
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Figure 7. Image categories 
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Figure 8. Proposed system training time 
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In the online stage multimedia have been arbitrarily chosen from every set-in images, text and audio 
database then the multimedia were tested by the CBIR system based on curvelet, histogram, and wavelet. When 
compared to CBIR when using wavelet and histogram, the training stage results demonstrate that when using 
curvelet, the neural networks system has a shorter search time. The curves of precision is shown in the Figure 9. 
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Figure 9. The precision curve based on histogram, wavelet and curvelet 


9. CONCLUSION 

This work presents a neural network for the retrieval of multimedia information. The classification 
information is so important to reduce the search time. This point appears when the user search in clustering 
database therefore the proposed system gives good results in a with low retrieving time. A neural network 
learning is modelled and trained on a collection of different multimedia. The learned features have been 
utilized for presenting a highly sufficient system of multimedia retrieval which operates for a large set of 
multi-modal datasets. For a futurework, there will be an investigation on more sophisticated approaches of 
deep learning and assess more other different datasets for more in-depth experiential research to provide 
more knowledge for reducing the semantic gap of the retrieval of multimedia information in the long term. 
This field of study utilizes enhanced statistical analysis and machine learning approaches for uncovering 
patterns and correlations in information multimedia and conventional databases which classical approaches 
might not be able to discover. 
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